rand_distr: Add Zipf distribution #1136

vks · 2021-06-13T17:54:43Z

Fixes #1069.

vks · 2021-06-13T17:55:13Z

cc @kaimast

vks · 2021-06-13T17:55:57Z

I did not compare to the reference implementation in numpy yet.

dhardy

May be worth comparing this? https://jasoncrease.medium.com/rejection-sampling-the-zipf-distribution-6b359792cffa

rand_distr/src/zipf.rs

vks · 2021-06-14T11:54:20Z

May be worth comparing this? https://jasoncrease.medium.com/rejection-sampling-the-zipf-distribution-6b359792cffa

On a first look, I'm not sure this method is different, but I would have to do the math to be sure.

vks · 2021-06-15T12:39:04Z

@dhardy The parametrization of the Zipf distribution you linked is different:

It uses the rank N and the exponent s >= 0, returning an integer from [1, N]. This follows Wikipedia.
The current implementation only has the exponent a >= 0 as a parameter and returns integers >= 1. This follows numpy.

Which paramtrization should I implement?

dhardy · 2021-06-15T13:07:34Z

Sorry, I don't feel qualified to answer that.

vks · 2021-06-15T14:17:49Z

@dhardy Fair enough, I'm also not sure what's best without having a use case in mind.

@kaimast Any preferences? What's your use case for sampling the Zipf distribution?

vks · 2021-06-15T21:51:53Z

I also found an R library using the N-s parametrization.

The a parametrization is a limit of the N-s parametrization for N -> oo. So maybe N-s is a better choice? Or both should be supported?

Wikipedia calls a the zeta distribution, so maybe we should do the same.

This follows the naming convention on Wikipedia.

It seems like this is not true [1], at least not with the same exponent. [1] https://en.wikipedia.org/wiki/Zeta_distribution

Also inline the distribution methods.

This should improve performance slightly.

vks · 2021-06-16T00:48:06Z

@dhardy I chose to implement both distributions. Zeta is equivalent to numpy.random.zipf, Zipf is equivalent to the Java implementation in the blog post you linked. I had to derive the correct equations for the s = 1 special case.

Arguably, the Zipf distribution is related to the Pareto distribution [1]. [1] https://en.wikipedia.org/wiki/Zipf's_law#Related_laws

saona-raimundo · 2021-07-23T17:59:08Z

Hi @vks!
@dhardy asked me if I could help reviewing this implementation, so I hope you don't mind some comments here and there :)

Looking at the previous comments you had quite a journey with names (and yeah, it is not very clear). But you got it right!

I have looked only at the implementation of Zeta for now. I added some suggestions on the documentation, but wanted to highlight that I have some questions about the sampling algorithm: I found the reference of the algorithm [Devroye86], so we have proved correctedness :) and my only doubts is about floating point representability... Well, we will talk more in the comments, I think.

Naming

I had a look at some references and leave a summary here. Maybe this is useful for the documentation, but I will leave it to you.

The general family of distributions is named Lerch [JKK05], after the Lerch zeta function. It is also called Hurwitz-Lerch [USB18].
There are many related distributions (at least 8 variants) [JKK05].
As you noticed, there are two variants (sometimes called the same): Zeta and Zipf.

Zeta distribution has support over all integers 1, 2, ...
Zipf distribution has support only over 1, 2, ..., N (it is simply the truncated version).

Other names for Zeta are: discrete Pareto, Riemann-Zeta, Zipf and Zipf–Estoup.
Special cases of Zipf distribution have names too: Estoup for s = 1 and Lokta for s = 2.

References

[JKK05] Section 11.2.20, page 526.
[Devroye86] Section 6.1, page 551.
[USB18] Chapter 3, Section 3.2.3, page 117.

saona-raimundo · 2021-07-28T09:21:53Z

Perfect! I understand the reasoning.

I would suggest saying something in the documentation about returning floats for integer random variables, but that deserves its own discussion and PR. This would involve investigating what is the effective domain of our implementations... information which is valuable for the user, but hard to get (or depends on parameters).

rand_distr/src/zipf.rs

vks · 2021-07-28T10:47:34Z

I would suggest saying something in the documentation about returning floats for integer random variables, but that deserves its own discussion and PR. This would involve investigating what is the effective domain of our implementations... information which is valuable for the user, but hard to get (or depends on parameters).

There was some discussion in #987 and #1093, which resulted in some documentation in the Rand book.

saona-raimundo

These are the comments about Zipf.

I have some disagreements with some of the computations, I tried to explain the reasoning in each case. Feel free to ask for more clarification.

Notably, I also propose to introduce one more error variant NTooBig.

rand_distr/src/zipf.rs

saona-raimundo · 2021-07-28T12:14:17Z

There was some discussion in #987 and #1093, which resulted in some documentation in the Rand book.

I see! Thanks, I did not know about it :)

rand_distr/src/zipf.rs

vks · 2021-08-02T22:40:08Z

Another case we might want to consider: For large a, the parameter b will be infinite, which results in all x being accepted. This is probably the best we can do, but we could also rather return an error when Zeta is constructed.

rand_distr/src/zipf.rs

vks · 2021-08-03T21:58:16Z

Because this PR is already quite large, I would like to leave the extended distribution tests for a follow-up PR.

saona-raimundo

@dhardy we are finished with the review and agree on adding distributional tests in a subsequent PR.

dhardy

Great. On that basis I approve merging. And thanks for stepping in to review.

vks · 2021-08-04T10:36:30Z

@saona-raimundo Thanks for the extensive review!

@dhardy Should I squash before merging?

dhardy · 2021-08-04T14:25:43Z

@vks can't say I care too much whether or not it gets squashed. The reason I didn't merge myself is because I wasn't quite sure whether you were ready to.

dhardy reviewed Jun 14, 2021

View reviewed changes

rand_distr/src/zipf.rs Outdated Show resolved Hide resolved

dhardy mentioned this pull request Jun 14, 2021

Prepare rand 0.8.4 release, core 0.6.3, distr 0.4.1, pcg 0.3.1, hc 0.3.1 #1137

Merged

vks added 12 commits June 16, 2021 02:38

rand_distr: Add Zipf distribution

0360aa9

Update changelog

1e1e768

Zipf: Use OpenClosed01

a57247d

Zipf: Add benchmark

718e71b

Fix value stability tests

c2ecf1b

Rename Zipf to Zeta

6c27184

This follows the naming convention on Wikipedia.

Don't claim Zeta follows Zipf's law

b06c2f6

It seems like this is not true [1], at least not with the same exponent. [1] https://en.wikipedia.org/wiki/Zeta_distribution

rand_distr: Add Zipf (not zeta) distribution

a07b321

Zipf: Fix s = 1 special case

6270248

Also inline the distribution methods.

Zipf: Mention that rounding may occur

4d67af2

Zipf: Simplify trait bounds

139e898

Zipf: Simplify calculation of ratio

f514fd6

This should improve performance slightly.

vks force-pushed the zipf branch from 558177a to f514fd6 Compare June 16, 2021 00:41

vks added 3 commits June 16, 2021 02:49

Zipf: Update benchmarks

ccaa4de

Zeta: Inline distribution methods

3cccc64

Group Zeta and Zipf with rate-related distributions

14d55f8

Arguably, the Zipf distribution is related to the Pareto distribution [1]. [1] https://en.wikipedia.org/wiki/Zipf's_law#Related_laws

dhardy added D-review Do: needs review E-help-wanted Participation: help wanted labels Jul 8, 2021

dhardy mentioned this pull request Jul 22, 2021

Adding skew normal random variable #1149

Closed

saona-raimundo reviewed Jul 28, 2021

View reviewed changes

rand_distr/src/zipf.rs Show resolved Hide resolved

saona-raimundo reviewed Jul 28, 2021

View reviewed changes

vks added 6 commits July 28, 2021 17:39

Give credit for implementation details

e19349c

Zipf: Fix inv_cdf for s = 1

a746fd2

Zipf: Correctly calculate rejection ratio

b053683

Zipf: Add debug_assert for invariant

0f9243c

Zipf: Avoid division inside loop

e5aff9a

Zeta: Mention algorithm in doc comment

a32cd08

vks commented Jul 30, 2021

View reviewed changes

rand_distr/src/zipf.rs Outdated Show resolved Hide resolved

vks added 2 commits July 30, 2021 14:29

Zeta: Avoid division in rejection criterion

72a6333

Zeta: Fix infinite loop for small a

cf4b7e4

saona-raimundo reviewed Aug 3, 2021

View reviewed changes

rand_distr/src/zipf.rs Show resolved Hide resolved

saona-raimundo reviewed Aug 3, 2021

View reviewed changes

rand_distr/src/zipf.rs Show resolved Hide resolved

Zeta: Document cases where infinity is returned

fe5a6e1

saona-raimundo approved these changes Aug 4, 2021

View reviewed changes

vks requested a review from dhardy August 4, 2021 09:56

dhardy approved these changes Aug 4, 2021

View reviewed changes

vks removed the D-review Do: needs review label Aug 4, 2021

vks merged commit b39e35f into rust-random:master Aug 4, 2021

vks deleted the zipf branch August 4, 2021 14:33

saona-raimundo mentioned this pull request Sep 5, 2021

SkewNormal distribution implementation #1174

Merged

dhardy mentioned this pull request May 22, 2024

It is not reasonable for rand_distr::Zipf to return floating-point values #1323

Closed

dhardy mentioned this pull request Jun 20, 2024

Add distribution plots to rand_distr documentation #1434

Merged

33 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rand_distr: Add Zipf distribution #1136

rand_distr: Add Zipf distribution #1136

vks commented Jun 13, 2021

vks commented Jun 13, 2021

vks commented Jun 13, 2021

dhardy left a comment

vks commented Jun 14, 2021

vks commented Jun 15, 2021 •

edited

Loading

dhardy commented Jun 15, 2021

vks commented Jun 15, 2021

vks commented Jun 15, 2021 •

edited

Loading

vks commented Jun 16, 2021 •

edited

Loading

saona-raimundo commented Jul 23, 2021

saona-raimundo commented Jul 28, 2021

vks commented Jul 28, 2021

saona-raimundo left a comment

saona-raimundo commented Jul 28, 2021

vks commented Aug 2, 2021

vks commented Aug 3, 2021

saona-raimundo left a comment

dhardy left a comment

vks commented Aug 4, 2021

dhardy commented Aug 4, 2021 •

edited

Loading

rand_distr: Add Zipf distribution #1136

rand_distr: Add Zipf distribution #1136

Conversation

vks commented Jun 13, 2021

vks commented Jun 13, 2021

vks commented Jun 13, 2021

dhardy left a comment

Choose a reason for hiding this comment

vks commented Jun 14, 2021

vks commented Jun 15, 2021 • edited Loading

dhardy commented Jun 15, 2021

vks commented Jun 15, 2021

vks commented Jun 15, 2021 • edited Loading

vks commented Jun 16, 2021 • edited Loading

saona-raimundo commented Jul 23, 2021

saona-raimundo commented Jul 28, 2021

vks commented Jul 28, 2021

saona-raimundo left a comment

Choose a reason for hiding this comment

saona-raimundo commented Jul 28, 2021

vks commented Aug 2, 2021

vks commented Aug 3, 2021

saona-raimundo left a comment

Choose a reason for hiding this comment

dhardy left a comment

Choose a reason for hiding this comment

vks commented Aug 4, 2021

dhardy commented Aug 4, 2021 • edited Loading

vks commented Jun 15, 2021 •

edited

Loading

vks commented Jun 15, 2021 •

edited

Loading

vks commented Jun 16, 2021 •

edited

Loading

dhardy commented Aug 4, 2021 •

edited

Loading